Knowledge Discovery Through Induction with Randomization Testing
نویسنده
چکیده
design IRT embodies a view of induction as a four-phase process (shown in Figure 1). The process alters a current model by generating a group of new competitor models, fitting those competitor models to data, comparing the competitors to each other, and then testing the statistical significance of the competitors. The process is iterative — it can be repeated until no competitor can be found that is significantly better than the current model. current model generate competitors fit competitors compare competitors test significance Replace current model with competitor any competitor significantly better? No Yes continue searching? accept current model No Yes Figure 1: IRT’s inductive process Generating competitors creates one or more models with a different structure than the current model. Examples include creating decision trees with different attributes and structure or classification rules with different conditions and operators. Each of these competitors is a candidate to replace the current model. Fitting sets the free parameters internal to each competitor model. Examples include setting the split points of decision trees or the condition values of classification rules. Comparing competitors evaluates the accuracy of each competitor in relation to the other competitor models. This is a relative comparison among competitor models, such as comparing two or more decision trees or classification rules that have different structure. Testing models evaluates the statistical significance of the competitors, asking "Does any of them perform substantially better than would be expected by chance?" This compares the competitors to external, absolute references. The statistical significance and raw accuracy of the competitors provide information to the investigator on whether a competitor should replace the current rule. Based on this information, and prior beliefs, the investigator can choose to replace the current rule, to generate additional competitors, or to accept the current rule. The process uses two disjoint samples of data. One sample is used for fitting models; The other sample is used for comparing and testing models. Two samples are used to assure accuracy — if models were fitted and compared based on the same sample, the comparison phase would favor models with more free parameters, because of their greater ability to conform to any particular sample. Randomization testing IRT tests models using randomization testing (Edgington 1980; Jensen 1991). Randomization testing generates a distribution that can be used to determine whether the score of the best of the competitors is significantly better than that of the current model. The distribution is generated by creating and testing large numbers of randomized data sets. The distribution reflects the scores that could be expected by chance alone. For instance, consider applying randomization testing in the problem of binary classification (Figure 2). The data consist of a set of examples, each with a label indicating its class (Class) and values for one or more attributes (A, B, C, etc.). Several competitors are tested on the data and the classification ability of each competitor is scored using an evaluation function. To test whether the the best competitor performs significantly better than the current model, we need to compare the score of the best competitor to the distribution of scores that would be expected by chance alone. As noted earlier, conventional statistical approaches to deriving this distribution are inappropriate because multiple, correlated models are tested. However, randomization testing can be used to create an accurate distribution. Randomization testing creates randomized data sets; Each randomized set is a copy of the actual data, but with class labels that have been reassigned. This reassignment is done in such a way that the current model retains its classification accuracy, but so that any remaining consistent relationship between the attributes and the class is destroyed. On randomized data, the current model will misclassify different cases and yet still maintain its overall classification accuracy. If any of the competitor models classifies randomized data more accurately than the current model, it will be due to chance alone.
منابع مشابه
Automated Software Warehouse Management
This paper proposes a knowledge-based approach to manage software warehouses. It is understood that knowledge acquisition is the bottleneck for intelligent systems of all kinds. Our research focuses on solutions for both theoretical and practical aspects of the bottleneck tasks through the proposed mechanisms of randomization, symbolic representation, and grammatical inference. Key-Words: Knowl...
متن کاملThe False Discovery Rate in Simultaneous Fisher and Adjusted Permutation Hypothesis Testing on Microarray Data
Background and Objectives: In recent years, new technologies have led to produce a large amount of data and in the field of biology, microarray technology has also dramatically developed. Meanwhile, the Fisher test is used to compare the control group with two or more experimental groups and also to detect the differentially expressed genes. In this study, the false discovery rate was investiga...
متن کاملDesigning an Ontology for Knowledge Discovery in Iran’s Vaccine
Ontology is a requirement engineering product and the key to knowledge discovery. It includes the terminology to describe a set of facts, assumptions, and relations with which the detailed meanings of vocabularies among communities can be determined. This is a qualitative content analysis research. This study has made use of ontology for the first time to discover the knowledge of vaccine in Ir...
متن کاملImproving Classification Knowledge Using an Integrated Knowledge Discovery Approach
Attribute-oriented induction approach (AOA) has been developed for knowledge discovery in large relational database. Several kinds of knowledge, such as characteristic rules and discrimination or classification rules can be discovered. These rules may contain unnecessary conditions and/or unnecessary conditionvalues. A Tuple-oriented approach (TOA) examines one tuple at a time since there are l...
متن کاملKnowledge Discovery from Client-Server Databases
The subject of this paper is the implementation of knowledge discovery in databases. Speciically, we assess the requirements for interfacing tools to client-server database systems in view of the architecture of those systems and of \knowledge discovery processes". We introduce the concept of a query frontier of an exploratory process, and propose a strategy based on optimizing the current quer...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1991